-
Notifications
You must be signed in to change notification settings - Fork 22
Added Prometheus client to get model server metrics #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Prometheus client to get model server metrics #64
Conversation
Welcome @aish1331! |
/assign @SachinVarghese |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Minor changes requested
inference_perf/main.py
Outdated
metrics_client: MetricsClient | None = None | ||
if config.metrics_client: | ||
if config.metrics_client.type == MetricsClientType.PROMETHEUS: | ||
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will simplify the client code
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client) | |
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client.prometheus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# assumption: all metrics clients have metrics exported in Prometheus format | ||
raise NotImplementedError | ||
|
||
@abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
methods repeated?
inference_perf/client/mock_client.py
Outdated
time_per_request=3, | ||
) | ||
) | ||
print("Processing request - " + str(payload.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep the sleep statement here to mock the processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And lets retain the behaviour of collecting request metrics in this client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
config.yml
Outdated
@@ -11,3 +11,8 @@ tokenizer: | |||
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct | |||
data: | |||
type: shareGPT | |||
# metrics_client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets uncomment these and let the test runner fallback incase the prometheus server is unreachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, the test runner prints the request summary but it also prints all the error logs for connection failures for each query.
request_metrics = runtime_parameters.model_server_client.get_request_metrics() | ||
if len(request_metrics) > 0: | ||
total_prompt_tokens = sum([x.prompt_tokens for x in request_metrics]) | ||
total_output_tokens = sum([x.output_tokens for x in request_metrics]) | ||
if runtime_parameters.duration > 0: | ||
prompt_tokens_per_second = total_prompt_tokens / runtime_parameters.duration | ||
output_tokens_per_second = total_output_tokens / runtime_parameters.duration | ||
requests_per_second = len(request_metrics) / runtime_parameters.duration | ||
else: | ||
prompt_tokens_per_second = 0.0 | ||
output_tokens_per_second = 0.0 | ||
requests_per_second = 0.0 | ||
|
||
request_summary = ModelServerMetrics( | ||
total_requests=len(request_metrics), | ||
requests_per_second=requests_per_second, | ||
prompt_tokens_per_second=prompt_tokens_per_second, | ||
output_tokens_per_second=output_tokens_per_second, | ||
avg_prompt_tokens=int(statistics.mean([x.prompt_tokens for x in request_metrics])), | ||
avg_output_tokens=int(statistics.mean([x.output_tokens for x in request_metrics])), | ||
avg_request_latency=statistics.mean([x.time_per_request for x in request_metrics]), | ||
median_request_latency=statistics.median([x.time_per_request for x in request_metrics]), | ||
p90_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=10)[8], | ||
p99_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=100)[98], | ||
avg_time_to_first_token=0.0, | ||
median_time_to_first_token=0.0, | ||
p90_time_to_first_token=0.0, | ||
p99_time_to_first_token=0.0, | ||
median_time_per_output_token=0.0, | ||
p90_time_per_output_token=0.0, | ||
p99_time_per_output_token=0.0, | ||
avg_time_per_output_token=0.0, | ||
avg_queue_length=0, | ||
) | ||
for field_name, value in summary: | ||
print("-" * 50) | ||
print("Request Summary") | ||
print("-" * 50) | ||
for field_name, value in request_summary: | ||
print(f"{field_name}: {value}") | ||
else: | ||
print("Report generation failed - no metrics collected") | ||
print("Report generation failed - no request metrics collected") | ||
|
||
if self.metrics_client is not None: | ||
print("-" * 50) | ||
print("Metrics Client Summary") | ||
print("-" * 50) | ||
metric_client_summary = self.metrics_client.collect_model_server_metrics(runtime_parameters) | ||
if metric_client_summary is not None: | ||
for field_name, value in metric_client_summary: | ||
print(f"{field_name}: {value}") | ||
else: | ||
print("Report generation failed - no metrics client summary collected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the calculation relevant parts of the code here to the base class so that it can be reused by various report types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
802ed3d
to
bcf2ee6
Compare
db44434
to
034cdba
Compare
/ok-to-test |
# Wait for the metrics to be ready | ||
metrics_client.wait() | ||
|
||
end_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't include the metrics wait time above in the total test run time if all the requests have finished already since that can skew the test results if we are waiting for the metrics scrape interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the duration of the test should be calculated correctly and also consider that this could be a multi-stage run so durations needs to be claculated for each stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
To tackle that I am now separately calculating perf run duration here and have moved the wait logic to Prometheus Client.
This duration will be used for computing the default request summary.
I now pass benchmark start time to the metrics client in perf runtime params. We call the wait function and then calculate the end time.
In this way, prometheus query runs for:
start timestamp = benchmark start time and
end timestamp = benchmark end + wait() ensuring we get all the metrics.
Also, if you can run |
@aish1331 can you rebase, address pending comments and create any follow up issues so that we can get this work in? |
@Bslabe123 FYI I am creating separate report files for request summary and the summary generated by metrics client. |
Added an issue to modify reportgen for multi-stage runs: #70 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aish1331, SachinVarghese The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue# #17
Description
Added Prometheus Client that connects to a Prometheus instance to fetch metrics for the benchmark. This is assuming Prometheus has model server metrics.
Engine tested: vLLM
Metrics fetched from vLLM server:
Queue Size
TTFT
TPOT
Input Token Count
Output Token Count
Successful Request Count
Request Latency
If metric_client is not passed in config.yml, the report generation falls back to benchmark request metrics.
TODO
Fetch KV cache usage metric from vLLM server via Prometheus.
Following are the logs with the report generated for a 2 stage benchmark run
As of this PR, metrics for the complete run are reported i.e. metrics for all the stages are aggregated.
Linter and other Validation Checks
Output of
make validate
command: